deterministic equivalence
The φCurve: The Shape of Generalization through the Lens of Norm-based Capacity Control
Understanding how the test risk scales with model complexity is a central question in machine learning. Classical theory is challenged by the learning curves observed for large over-parametrized deep networks. Capacity measures based on parameter count typically fail to account for these empirical observations. To tackle this challenge, we consider norm-based capacity measures and develop our study for random features based estimators, widely used as simplified theoretical models for more complex networks. In this context, we provide a precise characterization of how the estimator's norm concentrates and how it governs the associated test error. Our results show that the predicted learning curve admits a phase transition from under-to over-parameterization, but no double descent behavior. This confirms that more classical U-shaped behavior is recovered considering appropriate capacity measures based on models norms rather than size. From a technical point of view, we leverage deterministic equivalence as the key tool and further develop new deterministic quantities which are of independent interest.
A Random Matrix Theory Perspective on the Consistency of Diffusion Models
Wang, Binxu, Zavatone-Veth, Jacob, Pehlevan, Cengiz
Diffusion models trained on different, non-overlapping subsets of a dataset often produce strikingly similar outputs when given the same noise seed. We trace this consistency to a simple linear effect: the shared Gaussian statistics across splits already predict much of the generated images. To formalize this, we develop a random matrix theory (RMT) framework that quantifies how finite datasets shape the expectation and variance of the learned denoiser and sampling map in the linear setting. For expectations, sampling variability acts as a renormalization of the noise level through a self-consistent relation $σ^2 \mapsto κ(σ^2)$, explaining why limited data overshrink low-variance directions and pull samples toward the dataset mean. For fluctuations, our variance formulas reveal three key factors behind cross-split disagreement: \textit{anisotropy} across eigenmodes, \textit{inhomogeneity} across inputs, and overall scaling with dataset size. Extending deterministic-equivalence tools to fractional matrix powers further allows us to analyze entire sampling trajectories. The theory sharply predicts the behavior of linear diffusion models, and we validate its predictions on UNet and DiT architectures in their non-memorization regime, identifying where and how samples deviates across training data split. This provides a principled baseline for reproducibility in diffusion training, linking spectral properties of data to the stability of generative outputs.
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Atanasov, Alexander, Bordelon, Blake, Zavatone-Veth, Jacob A., Paquette, Courtney, Pehlevan, Cengiz
Modern deep learning practice is governed by the surprising predictability of performance improvement with increases in the scale of data, model size, and compute [17]. Often, the scaling of performance as a function of these quantities exhibits remarkably regular power law behavior, termed a neural scaling law [2, 6, 12, 13, 15, 16, 18, 19, 22, 32]. Here, performance is usually measured by some differentiable loss on the predictions of the model on a held out test set representative of the population. Given the relatively universal behavior of the exponents across architectures and optimizers [11, 18, 19], one might hope that relatively simple models of information processing systems might be able to recover the same types of scaling laws. The (stochastic) gradient descent (SGD) dynamics in random feature models were analyzed in recent works [7, 20, 26] which exhibits a surprising breadth of scaling behavior and captures several interesting phenomena in deep network training. Each of the above works has isolated various effects that can hurt performance compared to the idealized infinite data and infinite model size limits. The model was first studied in [7], where the bottlenecks due to finite width and finite dataset size were computed and, for certain data structure, resulted in a Chinchilla-type scaling result as in [18].
Re-examining Double Descent and Scaling Laws under Norm-based Capacity via Deterministic Equivalence
Wang, Yichen, Chen, Yudong, Rosasco, Lorenzo, Liu, Fanghui
The number of parameters, i.e., model size, provides a basic measure of the capacity of a machine learning (ML) model. However it is well known that it might not describe the effective model capacity (Bartlett, 1998), especially for over-parameterized neural networks (Belkin et al., 2018; Zhang et al., 2021) and large language models (Brown et al., 2020). The focus on the number of parameters results in an inaccurate characterization of the relationship between the test risk R, training data size n, and model size p, which is central in ML to understand the bias-variance trade-off (Vapnik, 1995), double descent (Belkin et al., 2019) and scaling laws (Kaplan et al., 2020; Xiao, 2024). For example, even for the same architecture (model size), the test error behavior can be totally different (Nakkiran et al., 2020, 2021), e.g., double descent may disappear. Here we shift the focus from model size to weights and consider their norm, a perspective pioneered in the classical results in Bartlett (1998). Indeed, norm based capacity/complexity are widely considered to be more effective in characterizing generalization behavior, see e.g.
Risk and cross validation in ridge regression with correlated samples
Atanasov, Alexander, Zavatone-Veth, Jacob A., Pehlevan, Cengiz
Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging recent techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.